{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Further Hypothesis Testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select this cell and type Ctrl-Enter to execute the code below.\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from scipy import stats\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "data = pd.read_csv(\"stars.csv\")\n",
    "type_key = ['Brown Dwarf', 'Red Dwarf', 'White Dwarf', 'Main Sequence', 'Supergiant','Hypergiant']\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6 - Testing for normality"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Professor Xu thinks that the current system for classifying stars can be improved. In particular, she thinks that log(temperature) should be normally distributed for each star type. \n",
    "\n",
    "She has asked you to find out whether this is true under the current classification system, and if not, which types should be revised."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question: for which star types is log(temperature) normally distributed?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Q-Q plot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Theory\n",
    "\n",
    "We can use a Q-Q plot to investigate whether each sample resembles a normal distribution.\n",
    "\n",
    "If we set\n",
    "\n",
    "$x =$ the theoretical quantiles from the standard normal ($Z$) distribution, and\n",
    "\n",
    "$y =$ the quantiles from the sample,\n",
    "\n",
    "the Q-Q plot will be close to a straight line for samples that are approximately normally distributed.\n",
    "\n",
    "Although this is not a rigorous statistical method, it can be enough to suggest whether a normal approximation is likely to be valid for a particular data set, or to diagnose [*skewness*](https://en.wikipedia.org/wiki/Skewness) and/or [*kurtosis*](https://en.wikipedia.org/wiki/Kurtosis) :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def do_plots(sample, col='gray', lab=None, mu=None, sigma=None):\n",
    "    \n",
    "    if(mu):\n",
    "        n = len(sample)\n",
    "        norm_fit = stats.norm(loc=mu, scale=sigma)\n",
    "  \n",
    "    f = plt.figure(figsize=(12, 4))\n",
    "    \n",
    "    # histogram\n",
    "    ax = plt.subplot(121)\n",
    "    \n",
    "    nbins = 20\n",
    "    smin = sample.min()\n",
    "    smax = sample.max()\n",
    "    binwidth = (smax - smin)/nbins\n",
    "    bins = np.linspace(sample.min(), sample.max(), nbins)\n",
    "    \n",
    "    xlab = 'observed'\n",
    "    ylab = 'freq'\n",
    "    ax.set_xlabel(xlab)\n",
    "    ax.set_ylabel(ylab)\n",
    "    \n",
    "    ax.hist(sample, bins, alpha=0.5, color=col, label=lab)\n",
    "    \n",
    "    if(mu):\n",
    "        plt.plot(bins, n * binwidth * norm_fit.pdf(bins), color='black')\n",
    "    \n",
    "    if(lab):\n",
    "        ax.legend(loc='upper left')\n",
    "    \n",
    "    # Q-Q plot\n",
    "    \n",
    "    ax = plt.subplot(122)\n",
    "    \n",
    "    x = np.linspace(0,1,100)\n",
    "    normal_q = stats.norm.ppf(x)\n",
    "    sample_q = np.array(np.quantile(sample,x))\n",
    "    \n",
    "    xlab = 'standard normal'\n",
    "    ylab = 'observed'\n",
    "    ax.set_xlabel(xlab)\n",
    "    ax.set_ylabel(ylab)\n",
    "    \n",
    "    # the plot itself:\n",
    "    ax.scatter(normal_q,sample_q, color=col, label=lab)\n",
    "    \n",
    "    # the line passing through Q25 and Q75:\n",
    "    m = (sample_q[75] - sample_q[25])/(normal_q[75] - normal_q[25])\n",
    "    c = sample_q[25] - normal_q[25] * m\n",
    "    ax.plot(normal_q, m * normal_q + c, color='black')\n",
    "    \n",
    "    if(lab):\n",
    "        ax.legend(loc='upper left')\n",
    "    \n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### normally distributed data\n",
    "sample = stats.norm.rvs(loc=10,scale=2,size=100)\n",
    "\n",
    "### positively skewed data (e.g. lognormal)\n",
    "#sample = stats.lognorm.rvs(0.5,size=100)\n",
    "\n",
    "### negatively skewed data (e.g. constant - lognormal)\n",
    "#sample = 20 - stats.lognorm.rvs(0.5,size=100)\n",
    "\n",
    "### leptokurtic (heavy-tailed) data (e.g. Student's t-distribution with df=2)\n",
    "#sample = stats.t.rvs(2,loc=200,size=100)\n",
    "\n",
    "### platykurtic (thin-tailed) data (e.g. uniform distribution)\n",
    "#sample = stats.uniform.rvs(loc=10,scale=20,size=100)\n",
    "\n",
    "do_plots(sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It can also be a useful way to identify the data points that are responsible for any deviations from normality."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Application\n",
    "\n",
    "Run the code below to see the histograms and Q-Q plots for log(temperature) for each star type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Set the star type to test\n",
    "t = 0\n",
    "\n",
    "sample = data[data.type == t].temperature.apply(np.log)\n",
    "col= 'C' + str(t)\n",
    "lab= type_key[t]\n",
    "\n",
    "mu = sample.mean()\n",
    "sigma = sample.std()\n",
    "\n",
    "do_plots(sample,col,lab,mu,sigma)\n",
    "print(type_key[t])\n",
    "print('mean:', mu, '   SD:', sigma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For which of the star types does log(temperature) appear approximately normal? How would you describe the other distributions?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, for a more rigorous investigation of normality, we can use a statistical test:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Shapiro-Wilk test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Theory\n",
    "\n",
    "The [*Shapiro–Wilk test*](https://en.wikipedia.org/wiki/Shapiro–Wilk_test) tests the null hypothesis that a sample $x_1, ..., x_n$ came from a normally distributed population.\n",
    "\n",
    "It compares statistics obtained from the observed data to the expected values of statistics sampled from the standard normal distribution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Application"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Shapiro-Wilk test is easy to apply in scipy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### set the star type to test\n",
    "t = 0\n",
    "\n",
    "sample = data[data.type == t].temperature.apply(np.log)\n",
    "p_value = stats.shapiro(sample)[1]\n",
    "\n",
    "print(type_key[t])\n",
    "print('p =', p_value)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apply the Shapiro-Wilk test to each of the star types. Which of them produce p-values less than $\\alpha$? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Alternative tests for normality\n",
    "\n",
    "Many other tests for normality have been developed, including the Anderson-Darling, Cramér–von Mises and Kolmogorov-Smirnov (see below) tests. \n",
    "\n",
    "The Shapiro-Wilk test has been found to have the best statistical power for a given significance level.\n",
    "\n",
    "<br>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
